사전 학습된 Encoder-Decoder 모델 기반 질의응답 쌍 생성을 통한 기계 독해 학습 데이터 증강 기법

신현호; 최성필; Hyeonho Shin; Sung-Pil Choi

연구문헌

국내 논문지

홈 > 연구문헌 > 국내 논문지 > 한국정보과학회 논문지 > 정보과학회논문지 (Journal of KIISE)

정보과학회논문지 (Journal of KIISE)

Current Result Document :

한글제목(Korean Title)	사전 학습된 Encoder-Decoder 모델 기반 질의응답 쌍 생성을 통한 기계 독해 학습 데이터 증강 기법
영문제목(English Title)	Training Data Augmentation Technique for Machine Comprehension by Question-Answer Pairs Generation Models based on a Pretrained Encoder-Decoder Model
저자(Author)	신현호 최성필 Hyeonho Shin Sung-Pil Choi
원문수록처(Citation)	VOL 49 NO. 02 PP. 0166 ~ 0175 (2022. 02)
한글내용 (Korean Abstract)	기계 독해 연구는 문서에서 질문에 대한 정답을 찾는 것으로 대규모 데이터가 필요하지만 개인 연구자나 소규모 연구 기관이 구축하는 것은 한계가 있다. 이에 본 논문은 사전 학습 언어모델을 활용한 기계 독해 데이터 증강 기법을 제안한다. 기계 독해 데이터 증강 기법은 질의응답 쌍 생성 모델과 데이터 검증 모델로 구성된다. 질의응답 쌍 생성 모델은 정답 추출 모델과 질문 생성 모델로 구성되며, 두 모델 모두 BART 모델을 미세 조정하여 구축하였다. 데이터 검증 모델은 증강 데이터의 신뢰성을 높이기 위해 별도로 추가하였으며, 증강 데이터의 활용 여부를 결정한다. 검증 모델은 ELECTRA 모델을 기계 독해 모델로 미세 조정하여 사용하였다. 증강 기법을 통한 모델 성능 개선을 확인하기 위해 KorQuAD v1.0 데이터에 증강 기법을 적용하였다. 실험 결과 기존 모델 대비 EM Score의 경우 최대 7.2 상승하였고 F1 Score는 최대 5.7 상승하는 유의미한 결과를 도출하였다.
영문내용 (English Abstract)	The goal of Machine Reading Comprehension (MRC) research is to find answers to questions in documents. MRC research requires large-scale, high-quality data. However, individual researchers or small research institutes have limitations in constructing them. To overcome the limitations, in this paper, we propose an MRC data augmentation technique using a pre-training language model. This MRC data augmentation technique consists of a Q&A pair generation model and a data validation model. The Q&A pair generation model consists of an answer extraction model and a question generation model. Both models are constructed by fine-tuning the BART model. The data validation model is added to increase the reliability of the augmented data. It is used to verify the generated augmented data. The validation model is used by fine-tuning the ELECTRA model as an MRC model. To see the performance improvement of the MRC model through the data augmentation technique, we applied the data augmentation technique to KorQuAD v1.0 data. As a result of the experiment, compared to the previous model, the Exact Match(EM) Score increased up to 7.2 and the F1 Score increased up to 5.7.
키워드(Keyword)	데이터 증강 기계 독해 자연어 처리 질문 생성 정답 추출 data augmentation machine reading comprehension question generation natural language processing answer extraction
파일첨부	PDF 다운로드